Parameterised Reports With RMarkdown and Purrr

Ayush Patel

At the RLadies Bangalore Meetup

28-Aug-2022

Avatar

About Me

  • I am an economist working at the intersection of data, public policy, and development.
  • I am also a Rstudio(Posit) Certified Tidyverse Instructor.

Reach Me

@ayushbipinpatel
ayush.ap58@gmail.com
@AyushBipinPatel

Pre-requisite (This is all you need.)


1. Basics of Rmarkdown

  • Adding and executing code chunks
  • Basic Markdown Syntax (Bold, Headings, etc …)
  • Kniting a document

Advanvced RMarkdown knowledge is not required

  1. Fundamentals of Data Wrangling

Can you anticipate the output to this code

penguins %>% 
  dplyr::group_by(species) %>% 
  dplyr::summarise(
    median_body_mass = median(body_mass_g, na.rm = T)
  )

Required Packages


Essentials1:

  • rmarkdown
  • purrr
  • quarto

Choices Based on preference2:

  • dplyr
  • ggplot
  • highcharter
  • readr

The Two Key Concepts to learn


Creating Wrapper Function

Modifying an existing function to efficiently fit with a particular use case.

Using purrr::map() family functions

Instead of calling the wrapper function multiple times with different inputs, use map family functions to apply the wrapper function to desired sequence of input.

Wrapping a Function - First Key


Three things to keep in mind:

  • The primary function to wrap
  • Desired Output
  • All the required inputs

How to wrap a function


This means you will essentially create your own function. Consider this use case.

I wan to show that as sample size increases the mean of sample gets closer to the mean of population.

Primary function to wrap. The rnorm() can be used to get a random sample from a population of desired mean and standard deviation

Desired Output. We want random numbers from a population of fixed mean and fixed SD, but with different number of observation

Required Inputs. We have three inputs. The fixed mean, fixed SD and the number of observations.

How to wrap a function - Primary function to wrap

fixed_mean <- 8
fixed_sd <- 2

vec_norm <- rnorm(n = 10,mean = fixed_mean,sd = fixed_sd) # 3 inputs

vec_norm
 [1]  6.948382 10.505596  5.740777  7.999065  7.502900  8.771539  5.163323
 [8]  7.480460  5.600220  4.794188
mean(vec_norm)
[1] 7.050645
wrap_rnorm <- function(pass_n_value){
  rnorm(n = pass_n_value ,mean = fixed_mean,sd = fixed_sd)
}

vec_norm2 <- wrap_rnorm(pass_n_value = 20)

vec_norm2
 [1]  9.630968  5.959133  9.633806  8.150890  5.915478  6.755224 10.735542
 [8]  6.551124  7.989904  8.981636  7.811034 11.312163  8.916044  9.189349
[15]  9.701936  5.733621  6.905854 10.815325 11.259237  8.219931
mean(vec_norm2)
[1] 8.50841

We have a wrapper function

This function wrap_rnorm(), takes one value (say,n) — the number of observations. It then randomly generates n observations from a population of fixed mean and fixed SD.

wrap_rnorm(pass_n_value = 10)
 [1] 12.618119  9.223980  8.585175  8.389362  7.264103  6.133763  6.518785
 [8]  7.376522  7.864373  8.365424
wrap_rnorm(pass_n_value = 20)
 [1]  6.387129  8.124797  7.891840  7.704465 10.576320  9.737834  8.876747
 [8]  5.373789  7.548953 10.297046  6.845207 11.648763 10.734660  7.957484
[15]  8.517344  6.920973  5.107091  7.952919  7.081092  7.019311
wrap_rnorm(pass_n_value = 30)
 [1]  6.560223  8.048634 12.202200 10.293537  7.065803  7.125337  7.987380
 [8]  8.831583  3.588277  2.772305  8.413338  5.130236  3.962421  8.199020
[15]  9.129445  5.766644  5.534138  7.345632  9.517989 11.373459 10.362150
[22] 11.731038  9.502530  9.637268  5.814128  6.854545  8.256952  6.914727
[29]  8.316611  8.014058

But typing this multiple times, or copy pasting more than twice is not ideal.

This is where purrr::map() can help us

Using purrr::map() family - The Second Key

wrap_rnorm(pass_n_value = 10)
 [1] 9.277246 9.641011 8.223049 6.837769 6.293026 8.419360 4.221592 5.978006
 [9] 8.291693 4.890595
wrap_rnorm(pass_n_value = 20)
 [1] 10.662691  6.660318  8.136084  8.273103  4.673990  4.073667  6.608768
 [8] 10.553002  5.648907 10.678810  9.810218  7.450201 12.246437 10.078057
[15]  9.826700 10.830182  6.744612  8.320719  7.824346  9.483533
wrap_rnorm(pass_n_value = 30)
 [1]  7.544110  7.957464 10.360584  6.458247  7.957659  9.231404  7.645065
 [8]  4.849479  3.698372  9.564612  9.379014  5.769380  8.279075 10.821331
[15]  9.012934  8.599845  5.697840  7.351044 13.253735  7.386874  7.218736
[22]  8.535865 10.349491  8.745315  9.859614  6.145464  8.662354  8.967474
[29]  8.149633  7.770852
purrr::map(.x = c(10,20,30),.f = wrap_rnorm)
[[1]]
 [1] 10.816358  8.185007  7.427629  9.701427  9.765678  5.455677  5.022867
 [8]  8.137811  6.294769 12.163303

[[2]]
 [1]  8.874205  6.458478  8.397012  7.675455  8.285075  9.248115  6.769382
 [8]  8.274423  8.479237  7.037915  7.463629  9.427760 10.834785  9.288717
[15]  7.906149  9.209483  5.114173  6.765709  6.464328  5.945079

[[3]]
 [1] 6.499760 9.197877 8.827299 6.373612 6.428216 9.472122 7.703281 6.264655
 [9] 6.074504 6.371418 7.609697 7.992328 8.772184 5.284319 8.668518 6.659431
[17] 8.246815 8.134401 7.870927 8.885915 5.479374 9.391986 8.595907 9.444719
[25] 8.359750 9.802525 9.855422 8.697819 6.230392 6.811421

What if the function takes more than one argument??

The Second Key - More than two arguments

rnorm(n = 10,mean = 100,sd = 5)
 [1]  99.79911  94.50820 101.86479  97.39179 100.61663 103.46590  92.10241
 [8]  98.48696 100.98396 109.25042
rnorm(n = 5,mean = 0.654,sd = 51)
[1] -58.716636 -15.598206   8.496584 -31.628217 -16.963207
rnorm(n = 15,mean = -54,sd = 25)
 [1] -75.895778 -59.308389 -28.836774 -48.361434 -45.958544 -39.201747
 [7] -35.239147 -48.500105 -38.126520 -25.058001 -50.352329 -74.676758
[13] -50.054358  -8.709084 -64.967992
purrr::pmap(.l = list(
  n = c(10,5,15),
  mean = c(100,0.654,-54),
  sd = c(5,51,25)
),.f = rnorm)
[[1]]
 [1] 100.19341  98.48849 101.88900  93.10251  98.71095 101.92998 108.96508
 [8]  98.74750  99.33220  97.75242

[[2]]
[1] -85.03220  68.35520 -41.19661  38.62527  40.88699

[[3]]
 [1] -85.491147 -70.807020 -75.356104 -76.513720 -63.172638 -65.915728
 [7] -11.590671 -28.690708 -19.321058   5.820102 -53.210837 -10.895302
[13] -36.859627 -46.937393 -57.539149

Ready to Report

We now move on to parameterised reports.

Parameterised Reports

  • Why Parameterised reports?
  • What are parameters?
  • How to use these in .rmd files ?

Answering these three questions will provide a strong intuition about Parameterised Reports.

Why Parameterised reports?

  • .rmd creates reproducible and easy to iterate documents/reports
  • Adding parameters takes it one step forward by generating multiple reports using a single .rmd file, with different parameters.
  • Saves time.
  • Easy to update.

What are parameters?

IT IS JUST A FANCY NAME FOR VALUES.

I think of parameters as values that a .rmd assumes before it is rendered/knitted.

These are declared or stated in the yaml of the .rmd file.

A single .rmd file can have one or more parameters

---
title: My Document
output: html_document
params:
  year: 2018
  region: Europe
  printcode: TRUE
  date: !r Sys.Date()
---

These parameters declared in yaml can then be accessed/used anywhere in the .rmd file.

params$year
params$region

Knit with parameters

With knit button and changes in yaml

Change params in yaml as needed and use knit button

---
title: My Document
output: html_document
params:
  year: 2018 # change values here and press knit button
  region: Europe # change values here and press knit button
  printcode: TRUE # change values here and press knit button
  date: !r Sys.Date() # change values here and press knit button
---

Using the render function

rmarkdown::render(input = "MyDocument.Rmd", 
                  params = list(
                    year = 2017,
                    region = "Asia" # can change all or some parameters
))

Componentes of generating parameterised reports

  • Project and Directory Structure(step 0)
  • Decide on the contents of the reports(step 1)
  • Complete all data prep outside .rmd (step 2)
  • Write report structure in .rmd(step 3)
  • Create script to generate all reports(step 4)

Project and Directory Structure(step 0)

All the files should be contained in a project.

I prefer this structure — this is opinionated. Feel free to deviate from this.

Directory tree

A Simple Example of the entire process.

From here on forward, I shall complement the slides with an example I have created for generating parameterised reports with Rmarkdown and Purrr.

The github repository for this can be accessed here.

The final output and explanation can be accessed here. This is written in a manner where it can be used as a stand alone resource.

Deciding on the contents of the reports(step 1)

I have village amenities data from the Indian census 2011, for Gujarat state.

I want a report at district level. This means every district will have its own report. This is also where we decide on the parameters that we shall need.

The report should have the following:

  • Wikipedia search results of the district
  • summary statistics of demography of the district
  • visualization for geographical area and population of villages
  • visualizations for net sown area and irrigated area of villages

Data prep(step 2)

Create a separate R script which will clean, wrangle and make all necessary changes to raw data.(script_clean_raw_data.R in the scripts folder)

Save prepared data in appropriate location.(save in the data_prepared folder.)

Write the report structure in .rmd(step 3)

All the reports will be generated from the structure defined by this .rmd file.

Declare all parameters that were decided in this .rmd file.

It is in this file the analyses flow will be carried out.

I suggest to write this .rmd file keeping in mind some values that the params in this file can take. This makes it easier to implement the analyses flow.

This .rmd file can be stored in the scripts folder.

Script to generate all reports — One ring to rule them all(step 4)

Create a R script for functionally generating multiple parameterised reports.

In this script create a wrapper function around the rmarkdown: render() function.

Once this function is created. Create a vectors/lists, one for each param, that will contain the sequence of values to be passed to a given parameter.

Use the appropriate {purrr} function, if there are two or more params pmap is the way to go, apply the wrapper function over the vectors/lists of param inputs. This will generate all your reports and save those in the location specified.

Quarto

Same process can be followed for creating parameterised reprots with .qmd files as well.

With one major difference. Instead of rmarkdown::remder() we need to use quarto::quarto_render(). Note that quarto_render does not have the output_dir argument, therefore all reports from the .qmd files are generated in the same directory as the .qmd file.

Acknowledgements and references

Chapter 15 in Rmarkdown the definitive guide

{purrr}

Chapter 19 in R4DS

Tom Mock for get started with quarto

Slidecraft by Emil Hvitfeldt